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High dimensionality in data sets is one of the challenges faced in 
classification, data mining, and sentiment analysis. In the data set, many 
dimensionalities require effort to simplify. Many of these dimensionalities 
have a major impact on the complexity and performance of the algorithms 
used for classification. Various challenges were encountered, including how 
to determine the optimal combination of pre-processing techniques, how to 
clean the dataset, and determine the best classification algorithm. This study 
uses a new approach based on the combination of three powerful techniques 
which are: tokenizing-lowercasing-stemming (for series of preprocessing), 
support vector machine (SVM) for supervised classification, and fuzzy 
matching (FM) for dimensionality reduction. The proposed model was 
realized using 3 different datasets, namely Amazon product review, movie 
review, and airline review from Twitter. This study provides better findings 
than the previous results. Improved performance is generated by SVM 


Text mining combined with FM, resulting in 96% accuracy. So that the SVM-FM 
combination can be said to be the best combination for sentiment analysis on 
the given data set. 
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1. INTRODUCTION 

Sentiment analysis is an important task to detect sentiment polarity in the text, which is widely 
applied in e-commerce systems, blogs, and social media. Its main task is to group documents into various 
polarities. Based on automatic predictions, it assists the business owners to make an informed decisions and 
plan directions to grow their business [1]. The task of sentiment analysis can be thought of as a text 
classification problem as the process includes several operations ending in classifying whether a given text 
expresses a positive or negative sentiment [1]. It is estimated that 90% of the data to be analyzed, there is still 
unstructured or unorganized data. Unstructured business data is produced and retrieved every day in large 
quantities. The form is in the form of emails, testimonials, chats, social media conversations, surveys, 
articles, and documents. The data is difficult to conduct sentiment analysis in a timely and efficient manner. 

The challenge in conducting sentiment analysis lies in several stages. Noise found in text data is still 
very interesting to be a research topic. Several studies have stated the advantages of pre-processing 
techniques. Tokenizing is recommended to get the best classification results [2]-[4]. On the study of 
Resyanto et al. [5] stemming is able to produce the highest accuracy. Stop word elimination and transform 
case are also recommended to be applied to reduce noise in a group of text [6]. In this study, the performance 
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comparison for each pre-processing technique was carried out and applied to several classification algorithms 
to determine the best performance. 

High dimensionality in data sets is one of the challenges faced in classification, data mining, and 
sentiment analysis. In the data set there are many dimensionalities that require effort to simplify. Many of 
these dimensionalities have a major impact on the complexity and performance of the algorithms used for 
classification. These challenges can be addressed through the dimensionality reduction process. Fuzzy 
matching (FM) can be used to measure the inequality between two strings, where text matches can be found, 
even with misspelled and different words [3]. Measurement of text similarity can also apply Euclidean 
distance similarity (EDS) [7]. Principal component analysis (PCA) is a conversion technique that allows to 
reduce the size of a data set that includes a large number of interrelated dimensionalities, so that the current 
data can be expressed with a smaller number of variables [8]. However, some existing studies only solve 
problems on one data source. Therefore, it is necessary to know whether optimal performance is also 
generated when applied to other datasets. 

Analyzing large amounts of data is an expensive and time-consuming process. Due to that matter, it is 
vital to have the best model to produce the best classification performance. In this study, support vector 
machine (SVM) will be used for sentiment analysis classification. Prior to that, to reduce the dimensionality of 
the data, three algorithms will be used namely PCA, FM, and EDS. The proposed model was realized using 3 
different datasets, namely Amazon product review, movie review, and airline review from Twitter and 
compare against several identified algorithms which includes naive Bayes (NB), k-nearest neighbor (KNN), 
dan deep learning (DL). The results of this study will present a combination of pre-processing techniques, 
algorithms to reduce data dimensions, and the most optimal classification algorithm. The combination of 
several techniques and algorithms will produce the best accuracy, precision, recall, and f-measure values. 

Currently, much research on sentiment analysis have been carried out. Various challenges were 
encountered, including how to determine the optimal combination of pre-processing techniques, how to clean 
the dataset, and determine the best classification algorithm. The combination of pre-processing techniques used 
are tokenizing, stop words elimination, transform case, and stemming [1]. The tokenizing technique helps to 
reduce the dimensions of the problem in a group of texts. By cutting a group of sentences into chunks of words 
makes the analysis process into a simpler form [1], [7], [9}-{11]. Capitalization variability in this dataset can 
cause problems during classification and degrade performance. The technique of converting capital letters to 
lowercase is the most common method of dealing with problems in text data. The lowercasing technique helps 
avoid different variations of the same word as determined by the case [5], [7], [9]-[15]. The existence of the 
stop word in a group of text is quite a lot, but this word has no meaning to be analyzed. Therefore, removing the 
stop word is a good way to simplify the dimensions of the problem in the text [1], [4]-[6], [9], [11]-[16]. 
Stemming technique [1], [5]-[7], [9]-[12] has the advantage of being able to change words into basic forms 
(removing affixes). This step can reduce the analysis process on unnecessary text. 

Several classification algorithms show good results in classifying sentiment analysis. These 
algorithms include NB [17], KNN [9], [12], [17], SVM [4], [6], [9}-[16], [18], and DL [19]-[21]. Research by 
Shaban et al. [22] in his study yielded an accuracy of 98% with a classification using NB. According to 
Kumari et al. [23] uses SVM and produces 90% accuracy. According to Romadhon and Kurniawan [17] 
apply KNN and produce 75% accuracy. To simplify the dimensions of the analysed text, an algorithm is used 
FM [3], PCA [8], [24], [25], EDS [7]. 

SVM has been used in research on several system areas, such as: presidential election [12], health 
record data [18], customer satisfaction [16], fake consumer reviews [6], restaurant review [9], flood disaster 
news [14], Vietnamese [15], product reviews [4], [11]. The accuracy is quite high in research using the SVM 
algorithm. In the presidential election system, the accuracy is 76.5% [12], customer satisfaction system 
84.85% [18], and product reviews with accuracy 88.13% [11]. Various datasets are used, including Twitter 
[10], [12], [14], [16], [18], Cornell University, trip advisor [6], restaurant review [9], Vietnamese [15], 
product reviews [4], application comments, financial market news [10], and product review Amazon [11]. 
SVM already has the advantages of support vector and a dividing line (hyperplane) so it takes a little time to 
classify [12]. SVM with basic pre-processing can significantly improve accuracy [18]. SVM is great for big 
data. Takes longer time for classification [16]. Combination of SVM with n-gram improves accuracy [16]. 
The number and length of extracted word segments have a major influence on the performance of the 
classifier. The number of word segments is divided in sufficient numbers in the form of bigrams and trigrams 
[6]. SVM is suitable and produces good accuracy values for various types of datasets [13]. Good performance 
SVM in non-linear classification [4]. SVM produces best for several types of data sets. SVM yields best for 
multiple pre-processing combinations. SVM is recommended as the best algorithm for sentiment analysis 
[10]. The main objective of this study is to present the optimal combination of techniques and algorithms and 
of course the best accuracy value in sentiment analysis. We introduce a modified model to achieve this goal 
by using a combination of SVM and FM [3]. 
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2. METHOD 

Generating a dataset that is free from noise and achieves the best accuracy values is a challenge in 
the sentiment analysis domain. The proposed model consists of 6 stages. The first is data collection. The 
dataset is taken from 3 datasets about consumer reviews. Second, an experiment was conducted to apply a 
combination of pre-processing techniques to the collected datasets. Third, determine training data and testing 
data. Fourth, apply algorithms to reduce the dimensionality of the data. The fifth step is to classify with 
several algorithms. The last step is to measure, compare several algorithms, and compare with benchmark 
models. The following is a step by step of the method that has been carried out. 


2.1. Collecting dataset 

This study uses 3 datasets about customer reviews from different sources, namely Amazon product 
review (dataset A), movie review (dataset B), and airline review from Twitter (dataset C). The dataset is a 
dataset from studies that become benchmarks [11], [26]. Data were taken from the Kaggle website [27] and 
Amazon dataset [28]. The polarity of the data consists of positive and negative. The dataset file is then made 
into one file and processed with a word processing application (Microsoft Excel). 


2.2. Data pre-processing configurations 
The data pre-processing process is a series of procedures to clean up unnecessary data before the 
data is used in the analysis process. A series of techniques used are basic techniques to clean data, namely 

tokenizing, stop words elimination, transform case, and stemming [1]. 

- Tokenizing: the tokenizing technique is a method of breaking text into smaller parts (which are referred to 
as tokens), turning the content into meaningful data while retaining the text in sentences [1], [11]. Tokens 
are earned by separating text by spaces, punctuation, or line breaks [9]. 

- Stop words elimination: stop words are words that are ignored in processing and are usually stored in stop 
lists. Stop lists are lists of common words that have function but no meaning. The main characteristic in 
choosing stop words is usually words that have a high frequency of occurrence, for example connecting 
words such as "and", "or", “but” and "will" [12], [29]. 

- Transform case: this technique is one of the basic techniques that converts the entire text into lowercase 
or uppercase. If there are the same words, then the results will be combined into one. So that this 
technique can reduce the dimensionality of the dataset to be analyzed [9], [15]. 

- Stemming: stemming technique is a technique that converts words into basic words by removing the 
suffix attached to the word [1], [9]. 


2.3. Data determination 

Experiments using datasets are determined by the composition of the training data: testing data is 
80:10. Data in the three datasets are relatively the same in number, but dataset B has the greatest number of 
words. This happens because dataset B is movie audience reviews. The composition of the dataset settings is 
as shown in Table 1. 


Table 1. Compotition of dataset 

Data training Data testing 
#Pos #Neg  #Total #Words #Pos #Neg  #Total © #Words 
Dataset A 4,064 3,952 8,016 660,192 1,016 988 2,004 165,048 
DatasetB 3,824 4,176 8,000 1,899,664 956 1,044 2,000 474,916 
DatasetC 3,216 4,960 8,176 138,736 804 1,240 2,044 34,684 


Dataset 


2.4. Dimensionality reduction 

The challenge in pre-processing data is how to reduce noise and dimensionality of the data, so that it 
can speed up the analysis process or improve analysis performance. Performance improvement can be done 
by applying pre-processing techniques and reducing the dimensions of the data to be processed 
[7], [9], [11], [13]. Several algorithms are applied to see how they perform against the classification results of 
sentiment analysis, namely term frequency-inverse document frequency (TF-IDF), PCA, FM, and EDS. 


2.4.1. The term frequency-inverse document frequency 

The TF-IDF is a process of dimensionality reduction technique with the process of assigning a value 
to each word in the training data. To find out how important a word is in representing a sentence, a 
calculation will be given. The value of the TF-IDF depends on the frequency with which words appear in the 
document. TF is considered to have a proportion of importance according to the total occurrence in the text 
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or document. IDF is a token weighting method that functions to monitor the occurrence of tokens in a text 
set. In extraction with TF-IDF, calculate the value (w) of each document against keywords [11]. 


2.4.2. Principal component analysis 

Algorithm for reducing dimensions or variables by changing a set of correlated dimensions into 
uncorrelated dimensions. This algorithm will produce a value called the principal component (PC). The PC data 
is a linear combination of the original values before being reduced. The PC are obtained by projecting the vector 
into the space defined by the eigenvectors with some calculations, namely: i) calculate the covariance matrix; 
ii) find the eigenvalues and eigenvectors; iii) compute reduction percentage; and iv) PC [8], [24], [25]. 


2.4.3. Fuzzy matching 

A fuzzy way based on two nominal attributes. This means it matches examples which are not 
necessarily equal, but similar. Between the two chosen attribute we calculate a similarity. The operator 
merges the k most similar examples from both sides. The similarity method can be defined using the ‘distance 
measure’ parameter. Currently all similarity measures are Levenshtein distance based. Levenshtein distance is 
using the number of changes you need to do to get from one string to the other to define a distance [3]. 


2.4.5. Euclidean distance similarity 

Euclidean distance is a way to gauge how close vectors are to one another in a vector space. 
Euclidean is related to the Pythagorean Theorem and is usually applied to 1, 2 and 3 dimensions. But it's also 
simple when applied to higher dimensions. Therefore, it's crucial that we clarify what we mean when we talk 
about the distance between two vectors since, as we'll see in a moment, it's not always clear [7]. 


2.5. Sentiment classification 

Sentiment classification is a branch of text mining. Sentiment classification can be important in the 
process of evaluating a topic of concern. The main purpose of sentiment classification is to find out the 
polarity of positive, negative, and neutral sentiments. Based on research [11], [17], [30], for comparison 
purposes, the SVM will be compared against comparable classifier namely NB, KNN, and DL [19]. 


2.5.1. Naive Bayes 

With the use of conditional probabilities. The NB classifier assigns class labels to instances and 
records in order to perform the supervised method of object categorization in the future. The likelihood of an 
event occurring dependent on other occurrences that have (assumed, presumpted, stated, or confirmed) to 
occur is known as conditional probability. 


2.5.2. K-nearest neighbor 

The kNN algorithm is a classification algorithm that works by taking a number of K data closest 
(neighbors) as a reference to determine the class of new data. This algorithm classifies data based on 
similarity or similarity or proximity to other data [12], [17]. In general, the way the kNN algorithm works is 
as follows: i) determine the number of neighbors (K) that will be used for class determination considerations, 
ii) calculate the distance from the new data to each data point in the dataset, and iii) take several K data with 
the closest distance, then determine the class of the new data. 


2.5.3. Deep learning 

A multi-layer feed-forward artificial neural network that uses back-propagation to train with 
stochastic gradient descent forms the foundation of DL. DL architectures have already been employed in a 
variety of applications, such as computer vision, pattern recognition, and NLP. The ability to learn multi- 
level dimensionality representations is provided by DL architectures. The architectures look for learning 
models based on numerous layers of hierarchically nonlinear information processing [20]. 


2.6. Measurement and evaluation 

High retrieval performance is maintained while an effective preparation method accurately reflects 
the document in terms of both space and time. In this study, three metrics were employed namely accuracy, 
precision, and recall. These metrics serve to determine a sentiment classifier performance. 


3. RESULTS AND DISCUSSION 
The datasets used in this study are data on Amazon product reviews (dataset A), movie reviews 
(dataset B), and airline reviews from Twitter (dataset C). The reviews written by consumers have various 
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forms. In datasets A and B, the sentences written are quite long compared to dataset C. Dataset C (Twitter) 
contains a short review because there are character restrictions in writing reviews on Twitter. 

To be able to see the performance of each pre-processing technique, measurements were made on 
several combinations of using techniques. The combination is: i) comparing the results of dataset 
classification without techniques using all pre-processing techniques; ii) comparing the performance of 
individual pre-processing techniques; and iii) comparing the performance of the combination of several pre- 
processing techniques. In this section, we conduct several experiments to see the effect of each data pre- 
processing technique. The experiment was carried out on 3 datasets that had been prepared. Accuracy results 
were compared using 4 classifiers, NB, KNN, SVM, and DL. The first experiment was conducted to see the 
performance of each pre-processing technique on three datasets and several classifiers. The application of 
pre-processing techniques can increase the value of classification accuracy, as shown in Table 2. 


Table 2. Accuracy with and without pre-processing techniques 
Dataset A Dataset B Dataset C 
NB kNN SVM DL NB kNN SVM DL NB kKNN SVM DL 
No pre-processing 50 51 51 51 48 52 52 52 39.22 39.22 61.76 38.24 
All pre-processing 71 69 78 55. 62 66 75 53 72.55 76.47 79.41 73.53 


The SVM algorithm produces the highest accuracy value both before and after applying the pre- 
processing technique. The effect of pre-processing to produce the highest accuracy value is the use of the 
SVM algorithm, which is 79.41%. A significant increase is shown by the application of the KNN algorithm 
on dataset C, namely the accuracy increases by 37.25%, as shown in Figure 1. 
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Figure 1. Accuracy of application of classification algorithm 


The performance of each pre-processing technique was measured in the next experiment. The use of 
tokenizing technique produces a significant effect on accuracy results, as shown in Table 3. The effect of 
tokenizing produces the highest accuracy value is the use of the KNN algorithm, which is 77.45%. The 
increase in the accuracy value is 39.22% from before using the pre-processing technique. 


Table 3. Tokenization technique performance 
Dataset A Dataset B Dataset C 
NB kNN SVM DL NB kNN SVM DL NB kKNN SVM DL 
No pre-processing 50 51 51 51 48 52 52 52 39.22 39.22 61.76 38.24 
Tokenizing 75 64 74 59 70 63 71 53 71.57 7745 75.49 65.69 


Table 4 shows that the lowercasing, stop word elimination, and stemming techniques have no 
significant effect. In datasets A and B using the DL algorithm, lowercasing, stop word elimination, and 
stemming techniques actually make the accuracy value decrease. For dataset C using the SVM algorithm, 
only lowercasing and stemming techniques can increase the accuracy value. But the increase in accuracy is 
not very significant. Meanwhile for dataset C using the DL algorithm, lowercasing techniques, stop word 
elimination, and stemming can improve accuracy. The biggest increase is when the lowercasing technique is 
applied. 
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The next experiment is to do a combination of pre-processing techniques. The combination is as 
follow: i) tokenizing+lowercasing; ii) tokenizing+stop word elimination; and iii) tokenizing+lowercasing+ 
stemming. Based on the experimental results, the highest accuracy value resulted from the use of the SVM 
algorithm on three datasets. As seen in Table 5, In dataset A, the highest accuracy is obtained by using a 
combination of tokenizing+lowercasing. Meanwhile, in datasets B and C, the highest accuracy is obtained 
with a combination of tokenizing+lowercasing+stemming. 


Table 4. Lowercasing, stop word elimination, and stemming performance 


Dataset A Dataset B Dataset C 
NB KNN SVM DL _NB__kKNN SVM _DL __ NB kKNN__ SVM DL 
No pre-processing 50 51 51 51 48 52 52 52 39.22 39.22 61.76 38.24 
Lowercasing 50 51 51 49 48 52 52 48 39.22 39.22 62.75 62.75 
Stop word elimination 50 51 51 49 48 52 52 52 39.22 39.22 61.76 50 
Stemming 50 51 51 49 48 52 52 52.39.22 39.22, 62.75 39;22 


Table 5. Combination of text pre-processing techniques 


Dataset A Dataset B Dataset C 
NB kNN SVM DL NB _kKNN SVM DL NB _KNN _ SVM _ DL 
Tokenizing+lowercasing 75 71 80 55 67 69 73 52 73.53 77.45 79.41 62.75 


Tokenizing+stop word elimination 72 70 79 55 69 55 73 53 68.63 76.47 77.45 62.75 
Tokenizing+lowercasing+stemming 71 70 11 54 6l 66 76 53 72.55 76.47 _81.37 _75.49 


The classification algorithm that shows the best results is SVM, while the best combination of pre- 
processing techniques is tokenizing, lowercasing, and stemming. The combination is then tested by applying 
an algorithm to reduce the dimensions of the data, namely TF-IDF, PCA, FM, and EDS. The experimental 
results in Figures 2(a)-(c) show that the FM algorithm gives the best results, namely increasing the accuracy 
value to 96% in 3 datasets. Figure 2(a) shows that FM produces the highest accuracy value in dataset A, 
while PCA shows the lowest result. The use of FM in dataset A showed an increase of 16% (from 80% to 
96%). Likewise in Figure 2(b), in dataset B, FM shows the highest accuracy value compared to TF-IDF, 
PCA, and EDS. The application of FM on dataset B increased the accuracy value by 20% (from 76% to 
96%). Not different from dataset A and dataset B, Figure 2(c) shows that FM produces the highest accuracy 
value, and the FM has succeeded in increasing accuracy by 14.63% (from 81.37% to 96%). 
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Figure 2. Performance comparation of dimensionality reduction algorithm for (a) dataset A, (b) dataset B, 
and (c) dataset C 


At the evaluation stage, we implemented a series of experiments using pre-processing tokenizing, 
lowercasing, and stemming techniques. The SVM classification algorithm is applied with a combination of 
TF-IDF, PCA, FM, and EDS algorithms individually. The proposed model achieves improved performance 
results compared to the benchmark model. On benchmark models [11], the highest accuracy result obtained 
is 88.75%. While in the proposed model, the accuracy value increased to 96%. Figure 3 shows that the 
experiment on 3 datasets has a higher accuracy value than the benchmark model. Table 6 shows the complete 
data of accuracy, precision, recall, and f-measure values. 
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Figure 3. Performance of proposed model 


Table 6. Measurement of proposed model 
Accuracy Precision __ Recall _ F-Measure 


SVM+TF-IDF Dataset A 88.75 87.18 94.56 90.72 
(benchmark) Dataset B 84 78.08 85.42 81.59 
Dataset C 88.49 90.61 83.87 87.11 
SVM+FM Dataset A 96 95.65 95.65 95.65 
(proposed) Dataset B 96 100 92.45 96.08 
Dataset C 96 91.49 100 95.56 


4. CONCLUSSION 

This paper proposes a combination of SVM with dimensionality reduction techniques for sentiment 
analysis classification. To improve the performance of SVM, it is combined with dimensionality reduction 
techniques namely TF-IDF, PCA, FM, and EDS algorithms. Prior to that, data preprocessing techniques have 
been applied to clean the data which includes tokenizing, stop words elimination, transform case, and 
stemming. Upon completing the experiment, it is demonstrated that SVM-FM produces the best result with 
the accuracy of 96% for all the identified datasets. Therefore, it is safe to conclude that SVM-FM can be used 
to produce the best sentiment analysis performance. 
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